Chris Pollett > Old Classses >
CS267
( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Hw5] [Quizzes]

Practice Exams:
  [Midterm] [Final]

HW#5 --- last modified February 05 2019 20:16:20..

Solution set.

Due date: May 13

Files to be submitted:
Hw5.zip

Purpose: To gain experience with index compression methods and a variety of relevance measure. To learn about map reduce algorithms.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

CLO4 -- Give an example of how a posting list might be compressed using difference lists and gamma codes or Rice codes.

CLO5 -- Demonstrate with small examples how incremental index updates can be done with log merging.

CLO7 -- Know at least one Map Reduce algorithm (for example to calculate page rank).

Specification:

Do the following problems and submit them in Problems.pdf which you should include in your Hw5.zip folder:

Do problem 7.2 and 7.3 out of the book with the following modifications. For 7.2, after answering the question determine what would the bound be if `k=3` rather than `k=2`? For 7.3, how big is the constant over the non-anticipatory version of logarithmic merging?
In a plain text ASCII document, define a paragraph to be any block of text preceeded by a start of document or two blanks line and continuing until an end of document or two blank lines. Define a sentence to be a substring of a paragraph beginning with the start of a paragraph or the character after a period and ending with an end of paragraph or a period. Write a map reduce algorithm for computing the average sentence length for a corpus of plain text documents.
Write a map reduce algorithm which computes the conditional probability of a term `t` occurring in a sentence given that the sentence has a term `t'` for a corpus of plain text documents. i.e., p(t in sentence |t' in sentence).

For the coding part of the homework I would like you to split the search_program.php that we have been developing this semester into two programs. index_program.php and query_program.php. The first would be run from the command line with a syntax like:

php index_program.php path_to_folder_to_index index_filename

Here path_to_folder_to_index should be the path to a folder filled with plain text documents to index. These documents will be numbered 00.txt, 01.txt,..., 10.txt, 11.txt, 12.txt, ...etc. You can assume the folder has at most 100 files. You can treat the file name excluding the ".txt" file extension as the doc id. Your program on input such as the above should then write an index to index_filename. For this index, the dictionary should only store stemmed versions of words, and should take a dictionary-as-string approach to its layout. Posting lists should be stored as gamma-code compressed delta lists of offsets into a document map. A document map entry should consist of document id, a map entry length, document length, sorted list of distinct stemmed terms in the document and their frequencies. Alternatively, (choose one or the other approach) a posting can be a gamma-code compresses delta list offset followed by a gamma-coded frequency, and the document map entries are just document id's followed by document lengths.

Your query program will be run from the command line like:

php query_program.php index_filename query relevance_measure

Here index_filename is the name of an index file that might have been produced by your index_program.php, query is some query to run against this index file, and relevance_measure is one of BM25 and your choice of DFR, LMJM, LMD. You should have a readme.txt file which besides listing your team members says which of these three relevance measures you chose. Your program on the above input should use the index and compute a conjunctive query of the terms in query and score the resulting documents using the provided relevance measure. It should then output these in a descending order in a format usable by trec_eval. You should include with your project a test subfolder, which should have plain text documents with names as described above. Using this test set do some experiments to compare the measure you chose against BM25 using trec_eval. Write up your results in experiments.pdf.

Point Breakdown

Problems (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	4pts
index_program.php runs on the command line inputs as describes reading the provided directory and producing an inverted index output file.	1
dictionary in output file as described (stores stemmed terms using dictionary as a string) (0.5pts) and document map as described (0.5pts).	1
posting in output file as described and compressed using gamma codes.	1
query_program runs as described, reading the inverted index file only when computing the correct result documents	1
query_program ranks output according to provided relevance measure	1
experiment.pdf as described.	1
Total	10pts